knitr::opts_chunk$set(message = FALSE, warning = FALSE)

Now that we have our data in a nicely formatted file, we can move to R and follow a fairly standard workflow. Along the way, we’ll come across a few useful packages and data structures which will be applicable in many other contexts.

The packages we’ll use for this include not only the tidyverse, but a few other packages which are hosted on Bioconductor. We’ll also add the magrittr and scales packages as they contain some useful additional utilities.

library(limma)
library(edgeR)
library(AnnotationHub)
library(tidyverse)
library(magrittr)
library(scales)
library(pander)
library(ggrepel)

Data Setup

Import

First we should import the file we’ve created using featureCounts.

counts <- read_tsv("../2_alignedData/counts/genes.out")

This file has the gene identifiers in the first column, with the remaining columns containing the name of the bam file and the number of reads aligning to each gene. The first thing we might like to do is tidy up those column names.

colnames(counts) <- colnames(counts) %>%
    basename() %>%
    str_remove("_(10|23)_(03|04)_(2014|2016)_S[0-9]_fem_Aligned.+")

That looks much cleaner without losing any important information.

Create a DGE List

The main object type we like to use for differential gene expression is a DGEList, which stands for Digital Gene Expression List. These objects have two mandatory elements, with the first being our counts and the second being our samples. In these objects, we can consider the gene IDs to be the rownames and the sample names are the column names. Here’s one way to create a DGEList.

dgeList <- counts %>%
    as.data.frame() %>%
    column_to_rownames("Geneid") %>%
    DGEList() %>%
    calcNormFactors()

In the first 3 lines, we’re just setting the gene IDs as the rownames instead of being a column. From there we’ve turned it into a DGEList object, and calculated the normalisation factors, which can be seen in dgeList$samples If fitting counts using a negative binomial model (for discrete data), these adjust the model based on the library size and count distributions. We’ll use a different approach for our analysis, but doing this as we form the object is still good practice.

dgeList$samples
##                               group lib.size norm.factors
## 5_non_mutant_Q96_K97del_24mth     1   480834    1.0319290
## 2_non_mutant_Q96_K97del_6mth      1   858817    0.9141733
## 3_non_mutant_Q96_K97del_6mth      1   888102    0.9938903
## 4_non_mutant_Q96_K97del_24mth     1   416735    1.0588504
## 7_non_mutant_Q96_K97del_24mth     1   458429    1.0604135
## 6_non_mutant_Q96_K97del_24mth     1   573014    1.0352478
## 1_non_mutant_Q96_K97del_6mth      1   896454    0.9175484

In the samples element, we can set the group variable so let’s put our timepoints here.

dgeList$samples$group <- colnames(dgeList) %>%
    str_extract("(6|24)mth") %>%
    factor(levels = c("6mth", "24mth"))

Add Gene Information

ah <- AnnotationHub() %>%
    subset(species == "Danio rerio") %>%
    subset(dataprovider == "Ensembl") %>%
    subset(rdataclass == "EnsDb")
ensDb <- ah[["AH64906"]]
genes <- genes(ensDb) %>%
    subset(seqnames == 2)
mcols(genes) <- mcols(genes)[c("gene_id", "gene_name", "gene_biotype", "entrezid")]
dgeList$genes <- genes[rownames(dgeList),]

Data QC

Undetectable genes

As you may have noticed, no reads aligned to the 4th gene in this dataset so we can remove it from the data. There are probably many more genes in this boat too. Let’s do a logical test to see how manay genes were not detected in our dataset.

dgeList$counts %>% 
    rowSums() %>%
    is_greater_than(0) %>%
    table()
## .
## FALSE  TRUE 
##   319  1302

A common approach would be to remove undetectable genes using some metric, such as Counts per Million reads, known as cpm. We could consider a gene detectable if returning more than 1CPM in every sample from one of the treatment groups. Although our dataset is small (all libraries are < 1e6 reads), we usually deal with libraries betwen 20-30million reads, and this would equate to 20-30 reads aligning to a gene, in every sample from a treatment group. Here our smallest group is 3 so let’s see what would happen if we applied that filter.

dgeList %>%
    cpm() %>%
    is_greater_than(1) %>%
    rowSums() %>%
    is_greater_than(3) %>%
    table()
## .
## FALSE  TRUE 
##   481  1140

Losing about 1/3 of the genes is pretty common, so let’s now apply that to our dataset.

genes2keep <- dgeList %>%
    cpm() %>%
    is_greater_than(1) %>%
    rowSums() %>%
    is_greater_than(3)
dgeFilt <- dgeList[genes2keep,] %>% calcNormFactors()

Let’s compare the distributions of the two datasets. Note the peak at the left inthe first plot around zero. This is all of the genes with near-zero counts. Then note that this peak is missing the second plot, confirming that we have removed most of the undetectable genes.

par(mfrow = c(1,2))
dgeList %>%
    cpm(log = TRUE) %>%
    plotDensities(legend = FALSE, main = "A. Before Filtering")
dgeFilt %>%
    cpm(log = TRUE) %>%
    plotDensities(legend = FALSE, main = "B. After Filtering")
Comparison of logCPM distributions before and after filtering for undetectable genes. Values o the x-axis represent logCPM

Comparison of logCPM distributions before and after filtering for undetectable genes. Values o the x-axis represent logCPM

par(mfrow = c(1,1))

Library Sizes

Let’s check our library sizes. It does appear that these being prepared on different days has given one of our groups more reads. This is not ideal, but using a normalised value like cpm does largely avoid any issues introduced by this.

dgeFilt$samples %>%
    ggplot(aes(group, lib.size, fill = group)) +
    geom_boxplot() +
    scale_y_continuous(labels = comma) +
    labs(x = "Timepoint", y = "Library Size") +
    theme_bw() 
Library Sizes after filtering for undetectable genes.

Library Sizes after filtering for undetectable genes.

PCA

Next we might choose to perform a Principal Component Analysis on our data, commonly abbreviated to PCA. This time, let’s take our CPM values & asses them on the log2 scale to make sure the results are not heavily skewed by highly expressed genes.

pca <- dgeFilt %>%
    cpm(log = TRUE) %>%
    t() %>%
    prcomp() 

In our DGEList, we have the genes as the variables of interest for our main analysis, however for the PCA we’re looking at out samples as the variables of interest. The third line i the above code chunk has transposed the matrix returned by cpm() to place the samples as the rows, which is where the function prcomp() expects to see the variables of interest.

A quick inspection of the results shows tht the first two components capture most of the varibility, as expected. Beyond this observation, the details of PCA are beyond what we can cover here.

summary(pca)$importance %>% pander(split.tables = Inf)
  PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 10.88 8.17 5.864 5.683 5.197 4.838 1.44e-14
Proportion of Variance 0.3917 0.2209 0.1138 0.1069 0.08936 0.07746 0
Cumulative Proportion 0.3917 0.6125 0.7263 0.8332 0.9225 1 1

We can also plot our results to see if samples group clearly with the treatment group based on our main two principal components. Any clear separation can be considered a positive sign that we will find differentially expressed genes.

plotly::ggplotly(
    pca$x %>%
        as.data.frame() %>%
        rownames_to_column("sample") %>%
        as_tibble() %>%
        dplyr::select(sample, PC1, PC2) %>%
        left_join(rownames_to_column(dgeFilt$samples, "sample")) %>%
        ggplot(aes(PC1, PC2, colour = group, label = sample)) +
        geom_point(size = 3) +
        theme_bw()
)

PCA showing two clear groups in the data

Differential Expression

In the above, we have read counts which are a discrete value and formally cannot be modelled using the assumption of normally distributed data. This rules out linear models and t-tests, so many packages have been developed which use the negative binomial distribution to model these counts. An alternative was proposed by Law et al, where they apply a system of weights to the counts which allow the assumption of normality to be applied. This method is called voom and we’ll use this today.

voomData <- voom(dgeFilt)

Note that this has added a design matrix to the data, and we can use this to perform a simple linear regression on each gene, which amounts to a t-test in this dataset. From here it’s a simple matter to code the analysis and inspect results.

topTable <- voomData %>% 
    lmFit() %>%
    eBayes %>%
    topTable(coef = "group24mth", n = Inf) %>%
    as_tibble()
topTable <- topTable %>%
    unite("Range", ID.start, ID.end, sep = "-") %>%
    unite("Location", ID.seqnames, Range, ID.strand, sep = ":") %>%
    dplyr::select(Geneid = ID.gene_id, 
                  Symbol = ID.gene_name,
                  AveExpr, logFC, t, P.Value, 
                  FDR = adj.P.Val, 
                  Location, 
                  Entrez = ID.entrezid)
topTable %>%
    mutate(DE = FDR < 0.05) %>%
    ggplot(aes(logFC, -log10(P.Value), colour = DE)) +
    geom_point() +
    geom_text_repel(data = . %>% 
                        dplyr::filter(DE) %>%
                        dplyr::filter(-log10(P.Value) > 4 | abs(logFC) > 2.5),
                    aes(label = Symbol)) + 
    scale_colour_manual(values = c("grey", "red")) +
    theme_bw() +
    theme(legend.position = "none")

topTable %>%
    mutate(DE = FDR < 0.05) %>%
    arrange(desc(P.Value)) %>%
    ggplot(aes(AveExpr, logFC, colour = DE)) +
    geom_point(alpha = 0.5) +
    geom_text_repel(data = . %>% 
                        dplyr::filter(DE) %>%
                        dplyr::filter(abs(logFC) > 2 | AveExpr > 14),
                    aes(label = Symbol)) + 
    scale_colour_manual(values = c("grey", "red")) +
    labs(x = "Average Expression (log2 CPM)",
         y = "log Fold-Change") +
    theme_bw() +
    theme(legend.position = "none")

GO Enrichment

Bioconductor doesn’t maintain a mapping for GO terms for Danio rerio using EntrezGene identifiers.

ens2Entrez <- file.path("https://uofabioinformaticshub.github.io/Intro-NGS-fib", "data", "ens2Entrez.tsv") %>% 
    url() %>%
    read_tsv()
de <- topTable %>%
    dplyr::filter(FDR < 0.05) %>%
    dplyr::select(Geneid) %>%
    left_join(ens2Entrez) %>%
    dplyr::filter(!is.na(Entrez)) %>%
    .[["Entrez"]] %>%
    unique()
uv <- topTable %>%
    dplyr::select(Geneid) %>%
    left_join(ens2Entrez) %>%
    dplyr::filter(!is.na(Entrez)) %>%
    .[["Entrez"]] %>%
    unique()
goResults <- goana.default(de, uv, "Hs")
head(goResults)
##                                                                          Term
## GO:0001932                              regulation of protein phosphorylation
## GO:0001934                     positive regulation of protein phosphorylation
## GO:0002252                                            immune effector process
## GO:0002253                                      activation of immune response
## GO:0002376                                              immune system process
## GO:0002429 immune response-activating cell surface receptor signaling pathway
##            Ont   N DE      P.DE
## GO:0001932  BP  65  8 0.7744395
## GO:0001934  BP  49  6 0.7589056
## GO:0002252  BP  49  8 0.4410888
## GO:0002253  BP  29  5 0.4321118
## GO:0002376  BP 131 23 0.1987998
## GO:0002429  BP  18  2 0.7718615
goResults %>% 
    rownames_to_column("go_id") %>%
    as_tibble() %>%
    dplyr::filter(DE > 1) %>%
    arrange(P.DE) %>%
    mutate(FDR = p.adjust(P.DE, "fdr")) %>%
    dplyr::filter(FDR < 0.05) %>%
    pander(caption = "GO Terms considered as enriched in the set of differentially expressed genes")
GO Terms considered as enriched in the set of differentially expressed genes
go_id Term Ont N DE P.DE FDR
GO:0016887 ATPase activity MF 27 13 2.84e-05 0.04405